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Sequential Design for Ranking Response Surfaces 

Ruimeng Hu and Mike Ludkovski * 


Abstract. Motivated by the problem of estimating optimal feedback policy maps in stochastic control applications, 
we propose and analyze sequential design methods for ranking several response surfaces. Namely, given 
L > 2 response surfaces over a continuous input space X, the aim is to efficiently find the index of 
the minimal response across the entire X. The response surfaces are not known and have to be noisily 
sampled one-at-a-time, requiring joint experimental design both in space and response-index dimensions. 
To generate sequential design heuristics we investigate Bayesian stepwise uncertainty reduction approaches, 
as well as sampling based on posterior classification complexity. We also make connections between our 
continuous-input formulation and the discrete framework of pure regret in multi-armed bandits. To model 
the response surfaces we utilize kriging metamodels. Several numerical examples using both synthetic data 
and an epidemics control problem are provided to illustrate our approach and the efficacy of respective 
adaptive designs. 

Key words, sequential design, response surface modeling, stochastic kriging, sequential uncertainty reduction, ex¬ 
pected improvement 


1. Introduction. A central step in stochastic control problems concerns estimating expected 
costs-to-go that are used to approximate the optimal feedback control. In simulation approaches to 
this question, costs-to-go are sampled by generating trajectories of the stochastic system and then 
regressed against current system state. The resulting Q-values are hnally ranked to find the action 
that minimizes expected costs. 

When simulation is expensive, computational efficiency and experimental design become im¬ 
portant. Sequential strategies rephrase learning the costs-to-go as another dynamic program, with 
actions corresponding to the sampling decisions. In this article, we explore a Bayesian formulation 
of this sequential design problem. The ranking objective imposes a novel loss function which mixes 
classification and regression criteria. Moreover, the presence of multiple stochastic samplers (one 
for each possible action) and a continuous input space necessitates development of targeted re¬ 
sponse surface methodologies. In particular, a major innovation is modeling in parallel the spatial 
correlation within each Q-value, while utilizing a multi-armed bandit perspective for picking which 
sampler to call next. 

To obtain a tractable approximation of the Q-values, we advocate the use of Gaussian pro¬ 
cess metamodels, viewing the latent response surfaces as realizations of a Gaussian random field. 
Gonsequently, the ranking criterion is formulated in terms of the posterior uncertainty about each 
Q-value. Thus, we connect metamodel uncertainty to the sampling decisions, akin to the discrete- 
state frameworks of ranking-and-selection and multi-armed bandits. Our work brings forth a new 
link between emulation of stochastic simulators and stochastic control, offering a new class of 
approximate dynamic programming algorithms. 

1.1. Abstract Ranking Problem. Let : A —M, £ G £ = {1, 2,..., L} be L smooth functions 
over a subset X of We are interested in the problem of learning the resulting ranking of pLi 


^Department of Statistics and Applied Probability University of California, Santa Barbara 93106-3110 
hu@pstat.ucsb.edu,ludkovski@pstat.ucsb.edu. Work partially supported by NSF ATD-1222262. 

1 




2 


Ruimeng Hu and Michael Ludkovski 


over the input space X, namely finding the classifier 

(1.1) C(x) := argmm{//£(x)} G £. 

The functions are a priori unknown but can be noisily sampled. That is for any x £ X^ H 
we have access to a simulator Y£{x) which generates estimates of fJ,i{x): 

(1.2) Ye{x) = ne{x) + ee{x), i £ £, 

where ei are independent, mean zero random variables with variance a‘^{x). Intuitively speaking, 
we have L smooth hyper-surfaces on X that can be sampled via Monte Carlo. In the dynamic 
programming context, x is the system state, i indexes the various actions available to the con¬ 
troller, represents the expected costs-to-go and €£{■) captures the simulation noise arising 

from pathwise simulation of the underlying stochastic system and corresponding costs. 

Our goal is to identify the minimal surface globally over the entire input space. More precisely, 
we seek to assign at each x £ X a label C{x), while optimizing the loss metric 

(1.3) £(C,C) := ^ |/i^'(^)(x)-/xc(,j,)(x)| F{dx), 

where F{-) is a specified weight function on X determining the relative importance of ranking 
different regions. Thus, the loss is zero if the ranking is correct C(x) = C{x), and otherwise 
is proportional to the (positive) difference between the selected response and the true minimum 
fJ-c ~ l^c- The above criterion aims to identify the optimal action i*{x) = C{x) to take in state x; 
if the wrong action C{x) is chosen instead, then (1.3) captures the resulting integrated loss to the 
controller assuming a probability distribution F{-) of potential states x. 

The loss function in (1.3) blends regression and classification objectives. In regression, one 
seeks to estimate the response marginally with the loss function tied to a single surface In¬ 

stead, (1.3) is only about correctly identifying the index of the minimal response. As a result, small 
estimation errors are tolerated as long as the minimal response does not change, leading to a thresh¬ 
olding behavior in the loss function. In classification the loss function is discrete (typically with 
fixed mis-classification penalties), whereas (1.3) takes losses proportional to the mis-classification 
distance ~ l^C{x){x). A further key distinction is that in classification the sampling space is 

just X (returning a noisy label C{x) £ £), whereas in our context a sampling query consists of the 
location-index pair {x,i) G A x £, sampling one response at a time. The question of which surface 
to sample requires separate analysis over £. 

We analyze the design problem of constructing efficient sampling strategies that can well- 
estimate C(-) while optimizing the number of Monte Carlo samples needed. Because 
unknown, we frame (1.3) as a Bayesian sequential learning problem of adaptively growing a design 
Z that quickly learns C{x). Classical static, i.e. response-independent, designs are inadequate for 
ranking since the whole essence of optimizing computational efforts is predicated on learning the 
structure of the unknown /x^’s. Intuitively, learning manifests itself in focusing the sampling through 
discriminating both in the input space X (focus on regions where identifying C{x) is difficult) and 
in the sampling indices T (focus on the surfaces where /x^ is likely to be the smallest response). 

Due to the joint design space X x C, our problem allows for a dual interpretation. Fixing 
i, (1.1) is about reconstructing an unknown response surface x i—)■ fieix) through noisy samples. 
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Aggregating the different response surfaces, sequential design over X reduces to identifying the 
partitions of A = into the sets 

(1.4) Ci := {x : C(x) = i} = {x : fj.c(x)(x) = min= /u.i(x)}, i = l,...,L. 

Because in interiors of the partitions Cj the ranking C(x) is easier to identify, the main problem 
is to find the partition boundaries dCi. As a result, (1.1) is related to contour-finding, for which 
sequential design was studied in [22, 42, 43]. Standard contour-finding attempts to identify the level 
set {fii{x) = a} of the response surface, which corresponds to L = 2 with known fj, 2 {x) = o in 
(1.1). Hence, the analysis herein can be viewed as a multi-variate extension of contour finding. In 
turn, contour-finding generalizes the classical objective of minimizing a noisy response, connecting 
to the expected improvement/information gain trade-off in simulation optimization. In particular, 
we re-purpose the active learning rules of [14, 36]. 

Conversely, fixing x, the aim of determining the smallest response argmin^ |U£(x) corresponds 
to the setting of multi-armed bandits (MAB). The bandit has L arms and corresponding payoffs 
fii{x),i G £, with the decision-theoretic objective (1.1) known as the pure exploration problem 
[7, 8]. Decision policies for which arm to pull are usually expressed in terms of posterior mean 
and confidence about the respective payoff; this point of view motivates our use of Gap-Upper 
Confidence Bound (UCB) design strategies [4, 46]. Compared to this literature, (1.3) contains two 
key differences. First, the loss function is a weighted pure-regret criterion which to our knowledge 
has never been used in MAB context. Second, instead of a single bandit with independent arms, 
we treat the radical extension to a continuum of bandits indexed by x G X. Recently, [26, 17] 
considered multiple bandits which can be viewed as (1.1) with a discrete, non-metrized X. We 
generalize their setting to a continuous X, with a spatial correlation structure of the arms. 

1.2. Summary of Approach. To handle continuous state spaces x G X that appear in stochas¬ 
tic control, we adopt the framework of kriging or Gaussian process (GP) regression for modeling the 
Q-values. In both contexts of Design of Experiments (DoE) and continuous MAB’s, kriging models 
have emerged as perhaps the most popular framework [47]. In particular, kriging has been used ex¬ 
tensively for sequential regression designs as it allows an intuitive approach to borrow information 
spatially across samples to build global estimates of the entire response surface Two further 
advantages are the analytic structure of Gaussian processes that allows for analytic evaluation 
of many Expected Improvement criteria, and the ability to naturally transition between model¬ 
ing of deterministic (noise-free) experiments where data needs to be interpolated, and stochastic 
simulators where data smoothing is additionally required. 

More generally, we suggest a Bayesian perspective to global ranking, viewing the response 
surfaces as realizations of a random variable taking values in a given function space. This offers 
a tractable quantification of posterior metamodel uncertainty and related sequential metrics for 
determining the minimum surface. Thus, we stress that kriging is not essential to implementation 
of our algorithms; for example competitive alternatives are available among tree-based models, 
such as dynamic trees [25] and Bayesian trees [12]. Moreover, while classical kriging may not be 
flexible enough for some challenging problems, there are now several well-developed generalizations, 
including treed GPs [24], local GPs [20], and particle-based GPs [23], all offering off-the-shelf use 
through public R packages. 

Eollowing the Efficient Global Optimization approach [29], we define expected improvement 
scores that blend together the local complexity of the ranking problem and the posterior variance 
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of our estimates. In particular, we rely on the expected reduction in posterior variance and borrow 
from the Stepwise Uncertainty Reduction criteria based on GP regression from [41, 10]. We also 
investigate UCB-type heuristics [4] to trade-off exploration and exploitation objectives. Based on 
the above ideas, we obtain a number of fully sequential procedures that specifically target efficient 
learning of C(-) over the entire design space X. Extensive numerical experiments are conducted to 
compare these proposals and identify the most promising solutions. 

As explained, our algorithms are driven by the exploration-exploitation paradigm quantified in 
terms of (empirically estimated) local ranking complexity for C{x) and confidence in the estimated 
C. To quantify the local ranking complexity, we use the gaps A(x) [17, 9, 28]. For any x G X, 
denote by /r(i)(x) < pi( 2 ){x) < ... < pL(^^{x) the ranked responses at x and by 

A(x) := li(i)(x) -/r(2)(x) 

the gap between the best (smallest) and second-best response. A(x) measures the difficulty in 
ascertaining C(x): for locations where — pip 2 ) is big, we do not need high fidelity, since the 
respective minimal response surface is easy to identify; conversely for locations where /i(i) — pL( 2 ) is 
small we need more precision. Accordingly, we wish to preferentially sample where A(x) is small. 
This is operationalized by basing the experimental design decisions on the estimated gaps A(x). 

In terms of design over T, exploration suggests to spend the budget on learning the responses 
offering the biggest information gain. Namely, substantial benefits are available by discriminating 
over the sampling indices d. through locally concentrating on the (two) most promising surfaces 
pi{i),p.( 2 )- This strategy is much more efficient than the naive equal sampling of each Y^. In addition, 
since the noise level in Yi may vary with t this must also be taken into account. Summarizing, our 
expected improvement metrics blend empirical gaps A and empirical posterior uncertainty based 
on kriging variance 6e{x), jointly discriminating across A x £. 

Our contributions can be traced along three directions. First, we introduce and analyze a novel 
sequential design problem targeting the loss function (1.3). This setting is motivated by dynamic 
programming algorithms where statistical response models have been widely applied since the 
late 1990s [15, 33]. Here we contribute to this literature by proposing a Bayesian sequential design 
framework that generates substantial computational savings. This aspect becomes especially crucial 
in complex models where simulation is expensive and forms the main computational bottleneck. 
Second, we generalize the existing literature on Bayesian optimization and contour-finding to the 
multi-surface setting, which necessitates constructing new El measures that address joint design in 
space and index dimensions. We demonstrate that this allows for a double efficiency gain: both in 
X and in C. Third, we extend the multiple bandits problem of [17] to the case of a continuum of 
bandits, which requires building a full meta-model for the respective arm payoffs. Our construction 
offers an alternative to the recent work [8] on A-armed bandits and opens new vistas regarding 
links between MAB and DoF. 

Our approach also generalizes Gramacy and Ludkovski [22]. The latter work proposed sequen¬ 
tial design for the contour-finding case where the design is solely over the input space A. In that 
context [22] introduced several FI heuristics and suggested the use of dynamic trees for the re¬ 
sponse modeling. The framework herein however requires a rather different approach, in particular 
we emphasize the bandit-inspired tools (such as UCB) that arise with simultaneous modeling of 
multiple response surfaces. 

The rest of the paper is organized as follows. Section 2 describes the kriging response surface 
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methodology that we employ, as well as some analytic formulas helpful in the context of ranking. 
Section 3 then develops the expected improvement heuristics for (1.1). Sections 4 and 5 illustrate 
the designed algorithms using synthetic data (where ground truth is known), and a case-study from 
epidemic management, respectively. Finally, Section 6 concludes. 

1.3. Connection to Stochastic Control. Consider the objective of minimizing total costs as¬ 
sociated with a controlled state process X, 

T 

(1.5) c(0; uo:t) = ^ g{t, Xt, ut) 

t=o 

on the horizon {0,1,..., T}. Above g{t, x, u) encodes the stagewise running costs, uq-t is the control 
strategy taking values in the finite action space ut G £, and Xt = is a stochastic discrete-time 
Markov state process with state space A C The dynamics of are of the form 


X^+, = F{Xt,ut,^t+i) 

for some map F rAxTxM—>-A’, where is a random independent centered noise source. The 
performance criterion optimizes expected rewards, captured in the value function F(0,x), 


V{t,x):= inf K[c{t;ut:T)\Xt = x], t G {0,1,..., T}, x G A, 

Uf.T&A 

over all admissible closed-loop Markov strategies ut-.T G U. Thus, at time t, the action ut = 
u{t,Xt) is specified in feedback form as a function of current state Xt. The policy map (t,x) i—>■ 
u*{t,x) translates system states into actions and is related to the value function via the dynamic 
programming equation (DPE): 

(1.6) V{t,x) = min{g'(t,x,n) + Et [V{t + l,Xl^^)] (x)} = ^„*(x;t), 

(1.7) with gu{x-,t) := g{t,x,u)+Et[V{t + l,X^^.t)]{x). 


The notation Et[-](x) = E[-|Xt = x] is meant to emphasize averaging of the stochastic future at 
t + 1 based on time-t information summarized by the system state Xt = x. The term g,u{x',t) is 
the Q-value, providing the expected cost-to-go if one applies action u G C aX Xt = x. 

Solving the DPE is equivalent to computing the Q-values since by (1.6), V{t, x) = min£g£{/i£(x; t)}. 
The ranking problem in (1.1) is then known as the policy map x i—?• u*{t, x) that partitions the state 
space X into L action sets Ci{t). Given m*(s, •) for all s = f -|- 1,..., T and all x G (initialized via 
V{T,x) = g{T,x)), we observe that 


( 1 . 8 ) 


liu{x-,t) = g{t,x,u) + Et 


- rp 

9{n,X^,Un) 

_n=t+l 


(x), 


where {ut) is a strategy that uses action u at t and m*(s, A^) thereafter, s > t. Indeed, the sum in 

(1.8) is precisely the random variable for the pathwise costs-to-go c{t,ut-T)- The loss (1.3) is then 
the difference between acting optimally as u*{t,Xt) at t, vis-a-vis taking action £ (and then acting 
optimally for the rest of the future, {f -|- 1,..., T}), weighted by the distribution F{ dx) of Xt. 

The formulation (1.8) allows pursuit of policy search methods by tying the accuracy in (1.7) 
not to the immediate fidelity of (estimated) Q-values //«(•; t), but to the quality of the policy map 
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u*{t, x). Namely, one iteratively computes approximate policy maps u{s, •) for s = T — 1, T — 2,.. 
using (1.8) to construct u{t, •) based on {^(s, ■) : s > t}. Note that the original objective of finding 
F(0,x) requires solving T ranking problems of the form (1.1). 

This approach to dynamic programming is especially attractive when the action space £ is very 
small. A canonical example are optimal stopping problems where £ = {stop, continue}, i.e. L = 2. 
For a single stopping decision, the immediate reward /X 2 (x; t) is typically given, leading to the case of 
estimating a single Q-value /ii(x; t), see [22]. Multiple stopping problems where both and ^2 need 
to be estimated arise in the pricing of swing options [38], valuation of real options [1], and optimizing 
of entry-exit trading strategies [48]. The case L > 2 was considered for valuation of energy assets, 
especially gas storage [31], that lead to optimal switching problems. For example, storage decisions 
are usually modeled in terms of the triple alternative L = 3 of {inject, do — nothing, withdraw}. 
Small action spaces also arise in many engineering settings, such as target tracking [2, 27], and 
sensor management [16]. 

2. Statistical Model. 

2.1. Sequential Design. Fix a configuration {fj.£,i = and corresponding classifier 

C(-). A design of size K is a collection := x € X,£ € Z, with superscripts denoting 

vectors. Fixing Z^^\ and conditioning on the corresponding samples = {y£k{x^)){^^i, let 

q{k) ^ C{Y^'^, Z^^^) be an estimate of C. We aim to minimize the expected loss C{C{-, Z^^'^),C) 
over all designs of size K, i.e. 


( 2 . 1 ) 


inf E 
z-.\z\=K 


C{C{Y^--^,Z),C) 


where the expectation is over the sampled responses Y^'^. To tackle (2.1) we utilize sequential 
algorithms that iteratively augment the designs Z as T-samples are collected. The interim designs 
z(^) are accordingly indexed by their size k, where k = Ko, Kq + 1, ..., K . At each step, a new 
location is added and the estimate is recomputed based on the newly obtained 

information. The overall procedure is summarized by the following pseudo-code: 

1. Initialize Z^^°^ and 

2. LOOP for k = Ko,... 

(a) Select a new location and sample corresponding := Y(^k+i{x^~^^) 

(b) Augment the design = Z^^^ U 

(c) Update the classifier 2^(^+i)) by assimilating the new observa¬ 

tion 

3. END Loop 

The basic greedy sampling algorithm adds locations with the aim of minimizing the myopic 
expected estimation error. More precisely, at step k, given design Z^^'i (and corresponding Y^'^), 
the next pair is chosen by 


( 2 . 2 ) 


arg inf( 2 :fc+b^J=+i)gA:x£ 


E 


£(^^(yl:(fc+l)^^(fc+l))^^) 


where the expectation is over the next sample Y^k+i{x^^^). This leads to a simpler one-step-ahead 
optimization compared to the iL-dimensional (and typically we are looking at AT 3> 100) formulation 
in (2.1). Unfortunately, the optimization in (2.2) is still generally intractable because it requires 
• re-computing the full loss function C{-,C) at each step; 







Sequential Design for Ranking Response Surfaces 


7 


• finding the expected change in C given 

• integrating over the (usually unknown) distribution of 

• optimizing over the full d + 1-dimensional design space Y x 2. 

We accordingly propose efficient numerical approximations to (2.2), relying on the twin ideas of 
(i) sequential statistical modeling (i.e. computing and updating C as Z grows), and (ii) stochastic 
optimization (i.e. identifying promising new design sites {x,i)). 

2.2. Response Surface Modeling. A key aspect of sequential design is adaptive assessment 
of approximation quality in order to maximize information gain from new samples. Consequently, 
measuring predictive uncertainty is central to picking For that purpose, we use a 

Bayesian paradigm, treating fii as random objects. Hence, we work with a function space A4 and 
assume that m € Ai with some prior distribution Tq. Thus, for each x, (x) is a random variable 
whose posterior distribution is updated based on the collected information from samples (x, i, y£{x)). 
Given the information generated by the A:*^-step design Z^^\ Ta, = ct {Y£{x) : (x,£) we de¬ 

fine the posterior M^^\x) ~ y£{x)\Fk- The random variable Mf^\x) is the belief about /^^(x) 
conditional on Tk', its first two moments are referred to as the kriging mean and variance respec¬ 
tively, 

(2.3) ^if{x)■.= n^^£{x)\Tk], 

(2.4) 5f\xf := n^x) - rif\x)f\Fk]. 

We will use /2(x) as a point estimate of yt{x), and (j^(x) as a basic measure of respective uncertainty. 
The overall global map x t—;• M^^\x) is called the kriging surface. Note that while there is a 
spatial correlation structure over X, we assume that observations are independent across T (so 
sample noise ee _LL yi), so that the posteriors M^^\x), i = 1,2, ... are independent. 

The order statistics /I(i)(x) < /i( 2 )(x) < ... describe the sorted posterior means at a fixed x. A 
natural definition is to announce the minimum estimated surface 

(2.5) C(x) := argmm{/i^(x)} , 

i.e. the estimated classifier C corresponds to the smallest posterior mean, so that = ju(i)(x). 

On the other hand, the uncertainty about C(x) can be summarized through the expected minimum 
of the posteriors Mi, M 2 ,..., M^, 

( 2 . 6 ) m^^\x) = E[mm{yi{x),..., yLix))\Fk]. 

Observe that E[min£ £ii{x)\Fk] = m^^\x) < = min£E[^£(x)|Tfc], and we accordingly define the 

M-gap (“M” for minimum) 

(2.7) Af(x) :=/i(i)(x) — m(x) > 0. 

The M-gap measures the difference between expectation of the minimum and the minimum expected 
response, which precisely corresponds to the Bayesian expected loss at x in (1.3). This fact offers 
an empirical analogue £C{C) of the original loss function C{C,C) in (1.3), 

(2.8) £C{C):= [ M(x)F(dx). 

Jx 

The above formula translates the local accuracy of the kriging surface into a global measure of 
fidelity of the resulting classifier C and will be the main performance measure for our algorithms. 



8 


Ruimeng Hu and Michael Ludkovski 


2.3. Kriging. The response surfaces are assumed to be smooth in X. As a result, information 
about fie{x') is also revealing about //^(x) for x ^ x\ coupling observations at different sites. To 
enforce such conditions without a parametric representation, we view each m as a sample from a 
Gaussian process (GP). A GP is specified by its trend or mean function ti{x) = E[/r£(x)] and a 
covariance structure K,^ : —>■ M, with K,^{x^ x') = E[(/i£(x) — t^{x)){^i{x') — ti{x'))]. By specifying 

the correlation behavior, the kernel 1C encodes the smoothness of the response surface. 

Fix the response surface index I and let y = (y(x^),... ,y(x"'))^ denote the observed samples 
at locations x = These realizations are modeled as in (1.2) with the response represented as 

yi{x) = tf{x) + Ze{x), 


where te{-) is a hxed trend term and Z£(-) is a realization of a Gaussian process. Given the samples 


(x, the posterior of ye again forms a GP; in other words any collection M^^\x'i ),..., 


riri). 


is multivariate Gaussian with mean (x'), covariance {x[, x' ), and variance <5^*^ (x()^, specihed 
by [47, Sec. 2.7] (see also [3]): 


(2.9) 

( 2 . 10 ) 

with 


u^"^(x',x'-) = ICi{xi,x'j) - k^/^\x'i)'^{Ki + T,^P^)~'^k^"'\x'j) 




vf\x'i,x'i) 4"^ = (t£(x^),... ,t£(x”))^, and 

(/C£(x\x'),... ,/Cf(x'^, x'))^, := diag(o-|(x^),..., ct^^(x”)). 


and K£ is the n x n positive dehnite matrix (K£)jj := ICi{x^,x^), 1 < i, j < n. By independence 
across i, the vector of posteriors M{x) at a fixed x satishes 

M(x) ~ AA(^(x), A(x)) with /x(x) = [/2i(x),..., ^^(x)]^, A(x) = diag (5?(x),..., 5|,(x)) . 

A common choice is the Matern-5/2 kernel 

(2.11) /C(x, x'; s, 0) = s^(l + (a/S + 5/3)||x — x'Hq) • II®, lk||e = \/x diagOx’^. 

The length-scale parameter vector 6 controls the smoothness of members of A4 a:, the smaller the 
rougher. The variance scalar parameter determines the amplitude of fluctuations in the response. 

A major advantage of kriging for sequential design are updating formulas that allow to efficiently 
assimilate new data points into an existing fit. Namely, if a new sample (x, is added to an 

existing design x^'^, the mean and kriging variance at location x are updated via 

(2.12) y(^+i)(x) = y('^)(x) + A(x,x^+^xi^^)(/+i -y(^)(x^+i)); 

(2.13) d(^+i)(x)2 = 5 W(x)2 - A(x,x^+^x^^^)2[cj2(x(^+^)) - y(^)(x^+^)]. 


where A(x, x^'*^) is a weight function specifying the influence of the new sample at on 
X (conditioned on existing design locations x^'^). In particular, the local reduction in posterior 
standard deviation at x^^^ is proportional to the current [11]: 

<5W(x^+i) “ yppriy^pyp+Tp' 


( 2 . 14 ) 
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Note that the updated posterior variance {x)"^ is a deterministic function of which is 

independent of 

For our examples below, we have used the DiceKriging R package [45] to compute (2.9). The 
software takes as input the location-index pairs the corresponding samples yi{xY'''^ and 

the noise levels a‘f„{x"'), as well as the kernel family (Matern-5/2 (2.11) by default) and trend basis 
functions f^ix) and runs an EM MLE algorithm to estimate the hyper-parameters s,0 describing 
the kriging kernel JCi- 

2.4. Summary Statistics for Ranking. Given a fitted kriging surface M£{-) (for notational 
convenience in this section we omit the indexing by the design size k), the respective classifier C is 
obtained as in (2.5). Note that C{x) is not necessarily the MAP (maximum a posteriori probability) 
estimator, since the ordering of the posterior probabilities and posterior means need not match for 
L > 2. Two further quantities are of importance for studying the accuracy of C: gaps and posterior 
probabilities. Eirst, the gaps quantify the differences between the posterior means, namely 

(2.15) ^iix) := \ye{x) - mmfij{x)\, 

(2.16) A(x) := |/r(i)(x)-/2(2 )(x)|, 

where < /i( 2 ) < ■ • • < /r(L) a-re the ordered posterior means. Note that under L = 2, we have 
Ai(-) = A 2 (-) = A(-) due to symmetry. Second, define the posterior probabilities for the minimal 
rank 


(2.17) pe{x) := P {ixi{x) = /r(i)(x)|Tfc) = F{Mi{x) = m.mMj{x)). 

We refer to P(i)(x) > P[ 2 ){x) > ... > P(l){x) as the decreasing ordered values of the vector 
p{x) := {Pi{x)}]f^i, so that the index of P(i)(x) is the MAP estimate of the minimal response 
surface. The following proposition provides a semi-analytic recursive formula to evaluate p{x) in 
terms of the kriging means and variances {pi{x),5‘j{x)). 

Proposition 2.1 (Azimi et al. [5]). IfM{x) ~ A7(/x(x), A(x)), then for any i G £, 


(2.18) 


pe{x) = P 



min Mj{x) 




where <k(-) is standard normal cdf, and = [ri,r 2 ,... ,rL-i]^ = {A{£)A.{x)A{£)'^) ^/^A(7)/7(x), 
with A{£) a {L — 1) X L matrix defined via 




' 1 if j = ^, 

< —1 if 1 < i = j < £, or £<i + l = j<L, 
, 0 otherwise. 


Corollary 2.2. For L = 2, we havepi{x) = P(Mi(x) < M 2 (x)) = $ ( |, andp 2 (x) = 

1 -pi(x). 

The next proposition provides another semi-analytic formula to evaluate m(x) defined in (2.6). 
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Proposition 2.3. Suppose that L = 2 and let M^{x) ~ i = 1,2 he two indepen¬ 

dent Gaussians. Define 

di 2 ■■= \ldl{x) + 5|(x), and ai 2 := (^i(x) - 'fi 2 {x))/di 2 . 

Then the first two moments of M(^i){x) = min(Mi(x), M 2 (x)) are given by: 

(2.19) m{x) = E[M(i)(x)] = fii{x)^{-ai 2 ) + fi 2 {x)^{ai 2 ) - di 2 </>(ai 2 ), 

(2.20) E [M^i){xf] = ifilix) + dl{x))^{-ai 2 ) + {fil{x) + J|(x))^>(ai 2 ) 

- (/ii(x) + fi2{x))di2(l){ai2). 

Equation (2.19) provides a closed-form expression to evaluate m{x) = E[M(i)(x)] for L = 2. In 
the case L > 2, one may evaluate m{x) recursively using a Gaussian approximation. For instance, 
for L = 3, approximate Y := Mi{x) A M 2 {x) by a Gaussian random variable with mean/variance 
specihed by (2.19)-(2.20) respectively (i.e. using ai 2 and ^ 12 ) and then apply Proposition 2.3 once 
more to M(i)(x) =Y A M^{x). 

3. Expected Improvement. The Bayesian approach to sequential design is based on greedily 
optimizing an acquisition function. The optimization is quantified through Expected Improvement 
(El) scores that identify pairs {x, €) that are most promising in terms of lowering the global empirical 
loss function EC according to (2.2). In our context the El scores are based on the posterior 
distributions which summarize information learned so far about fii{x). 

Our two main heuristics are dubbed Gap-UGB and Gap-SUR: 

(3.1) := -Ai{x) +jkde{x); 

(3.2) :=E[M^^\x)-M^’^+^'>{x)\x^+^ = x,i^^^ =£,Fk]- 

The Gap-UGB score is motivated by the exploration-exploitation trade-off in MAB’s and favors 
locations with small gaps in posterior means and high kriging variance. Indeed, the local empirical 
gap measure [17] A^(x) identifies the most promising arm, while the kriging variance 5‘l{x) promotes 
exploration to reduce uncertainty about arm payoffs. The two are connected via the UGB (upper 
confidence bound [46]) tuning parameter 7 ^ that balances exploration (regions with high 5i{x)) 
and exploitation (regions with small gap). Another interpretation of Gap-UGB is to mimic a 
complexity-sampling scheme that selects design sites based on the complexity of the underlying 
ranking problem. Indeed, the gap Afix) := pifix) — pLj{x) measures the hardness of testing 

whether iJ,i{x) = min^ fii{x)', the smaller Afix) the tougher. At the same time, the kriging variance 
5‘^{x) can be related to information gain from sampling at x (being akin to the standard error of a 
point estimator). 

The Gap-SUR strategy is coming from the perspective of simulation optimization. Recall 
that we strive to lower the empirical loss EC in (2.8) which is related to the M-gap in (3.2), 
EC = f M(x)E(dx). Accordingly, the Gap-SUR criterion uses A4(x} to guide the adaptive design, 
by aiming to maximizing its expected local reduction if we add {x,£) to the design. Such Stepwise 
uncertainty reduction (SUR) strategies were introduced in [ 6 , 10]. The evaluation of (3.2) requires 
computing the expected mean and variance of M(^)(x) and Mfix). The updating formula (2.12) 
implies that (keeping JC fixed) {x)\x’^^^ = = £,Ek\ = Tiix), while (2.14) yields 

j^A:+i)(^)^ The rest of the computation becomes straightforward in view of Proposition 2.3. 
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Remark 3.1. Gap-SUR is also connected to the Active Learning Cohn (ALC) [I 4 ] approach to 
DoE. In ALC, minimization of posterior variance is achieved by greedily maximizing reduction 
in 5‘^{x). In Gap-SUR, minimization of SC is achieved by maximizing reduction in M.[x). The 
ALC paradigm suggests an alternative to (3.1), namely {x,l) = —lS.^{x) -\-'yk[df^\x) — 

that blends expected decline in kriging variance with the estimated gap. 

Asymptotic Behavior. The Gap-SUR method aims to drive the M-gaps to zero, which is 
equivalent to learning all the responses: M.{x) = 0 6 i{x) = 0 V^, see (3.2). For GP models, 

vanishing posterior variance at x corresponds to the design being dense in the neighborhood of x. 
Thus, asymptotically, the Gap-SUR heuristic will generate designs that are dense across A x £. 
Finally, previous results about consistency of GP models (see for example [13]) can be invoked to 
establish that C —)• C. 

On the other hand, proper selection of the UGB schedule ( 7 ^) is crucial for the performance 
of Gap-UCB. If = 0 then convergence is not guaranteed. Indeed, consider xi,X 2 such that 
A(x 2 ) > A(xi), but the estimated gaps based on interim satisfy A(xi) > A(x 2 ) > A(x 2 ) due 
to estimation error at xi. Then at stage k the algorithm will prefer site X 2 over xi (since it has 
smaller gap A) and will then possibly get trapped indefinitely, never realizing that the estimated 
ordering between A(xi) and A(x 2 ) is wrong. Hence without UGB the algorithm is prone to get 
trapped at local minima of A. At the same time, any increasing unbounded 7 ^ —)> -|-oo guarantees 
that supj, (5^^^ (x) —)> 0 Ml. Toward this end, Srinivas et al. [46] proved that in a cumulative regret 
setting 7 fc = 0{yJ\og k) should grow logarithmically in sample size k. Further rules on how to 
choose 7 fc (for the case of a finite state space A) can be found in [17]. Another alternative is a 
randomized version. For example, in e-greedy sampling, with probability e at any step instead of 
using an El metric, (x,£)^^^ are selected uniformly in A x £. This ensures that the designs 
are dense in A as /c —?• 00 and is a feature that we resort to in our experiments. Still, fine-tuning 
the schedule of A: 1 —?• 7 ^ is highly non-trivial in black-box settings. For this reason, usage of the 
Gap-UGB approach is sensitive to implementation choices and further guidance on selecting ( 7 ^) 
is left for future research. 

3.1. Selecting the Next Sample Location. To grow the designs Z^^'> over k = Kq,Kq -|- 1,... 
we use the El scores via the greedy sampling strategy 

(3.3) {x,l)^^^ = argsup Ek{x,l). 

ix,e)exxs. 

Because the above introduces a whole new optimization sub-problem, in cases where this is compu¬ 
tationally undesirable we instead replace arg sup 2 ,g_:^^ with arg max^^gT- where T is a finite candidate 
set. Optimization over T is then done by direct inspection. The justification for this procedure is 
that (i) we expect Ek{x,t} to be smooth in x and moreover relatively flat around x*; (ii) Ek[x,l) 
is already an approximation so that it is not required to optimize it precisely; (hi) performance of 
optimal design should be insensitive to small perturbations of the sampling locations. To construct 
such candidate sets T in A, we employ Latin hypercube sampling (LHS) [37]. LHS candidates 
ensure that new locations are representative, and well spaced out over A. See [21, Sec 3.4] for 
some discussion on how T should be designed. In addition, we refresh our candidate set T at each 
iteration, to enable “jittering”. Algorithm 1 below presents the resulting method in pseudo-code. 

Remark 3.2. In the context of a kriging model, the initial design Z^^°^ is crucial to allow the 
algorithm to learn the covariance structures of the responses. One common challenge is to avoid 




12 


Ruimeng Hu and Michael Ludkovski 


Algorithm 1 Sequential Design for Global Ranking using Kriging 
Require: Ko,K 

1 : Generate initial design ^ := using LHS 

2 : Sample estimate the GP kernels /C^’s and initialize the response surface models 

3: Construct the classifier using (2.5) 

4: /c •(— Kq 

5: while k < K do 

6 : Generate a new candidate set of size D 

7: Compute the expected improvement (El) Ek(x,£) for each x £T, i £ Z 

8 : Pick a new location = argmax Ek{x,i) and sample the corresponding 

(x,£)er('=)x£ 

9: (Optional) Re-estimate the kriging kernel fC^k+i 

10 : Update the response surface M^k+i using (2.12)-(2.13) 

11 : Update the classifier using (2.5) 

12 : Save the overall grid £- Z^^'> U 

13: k ^ k+1 

14: end while 

15: return Estimated classifier C^^^(-). 


assuming that yi’s are too flat by missing the shorter-seale fluctuations [42]. Thus, Kq must be 
large enough to reasonably estimate fCi; one recommendation is that Kq should be about 20% of the 
eventual design size K. In our implementation, the initialization is done via a spaee-filling LHS 
design (sampling equally across the L surfaces). Another issue is the re-estimation of the kriging 
kernel KLi in step 9 of Algorithm 1. Re-training is eomputationally expensive and makes the GP 
framework not sequential. Since we expect the algorithm to converge as k ^ oo, we adopt the 
practical rule of running the full estimation procedure for 1C according to the doubling method [18], 
re-estimating K.^ for A; = 2,4, 8 ,... a power of two, and keeping it frozen otherwise. 

3.1.1. Hierarchical and Concurrent Sampling. Instead of sampling directly over the pairs 
{x,£) £ X X Z, one can consider two-step procedures that first pick x and then I (or vice-versa). 
This strategy matches standard sequential designs over X. Indeed, one can then directly follow the 
active learning approach of [36, 14] by first picking using the gap metrics, and then picking 
the index based on the kriging variance: 

x^~^^ = argmin A(x)|Tfc, cf. (2.16) 

(3.4) ... 

= argmaxdl 

Conditional on picking x^^^, the above choice selects surfaces with large kriging variance dflx), 
attempting to equalize 5flx) across i. Note that (3.4) will focus on the most uncertain response, 
not on the most promising one, which tends to hurt overall performance when L 2. Another 
choice is to pick to greedily maximize the information gain as in (2.14). Such two-step El 
heuristics allow to avoid having to specify the schedule 7 *, of UCB criteria (3.1). 

A further variant is concurrent marginal modeling of each /if(-). This is achieved by concurrent 
sampling: after choosing a location x^^^ = x, one augments the design with the L respective pairs 
(x, 1), (x, 2),... (x, L). This approach “parallelizes” the learning of all response surfaces while still 
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building an adaptive design over X. The disadvantage of this strategy becomes clear in the extreme 
situation when the variance of hi(x) is zero, ai{x) = 0 while the noise ofY 2 {x) is large. In that case, 
after sampling a given location once for each response, (x, 1) and (x, 2), we would have (Ii(x) = 0, 
(^ 2 ( 3 ^) 0. Hence, another sample from li(x) would gain no information at all, while substantial 

information would still be gleaned from sampling Y 2 {x), making parallel sampling twice as costly 
as needed. 


4. Simulated Experiments. 

4.1. Toy Example. In this section we consider a simple one-dimensional example with synthetic 
data which allows a fully controlled setting. Let L = 2, X = [0,1]. The noisy responses yi(x) and 
Y 2 {x) are specified by (cf. the example in [45, Sec 4.4]) 


Yi(x) = /ii(x) -t- ei(x) = - f ^ + 2x^ cos(5x) + 0.841 

8 \ 1 + X 

Y 2 {x) = ^J. 2 {x) + 62 (x) = 0.5 -I- a 2 {x)Z 2 . 


+ (Tl(x)Zi, 


Here are independent standard Gaussian, and the noise strengths are fixed at cri(x) = 0.2 and 
cr 2 (x) = 0.1, homoscedastic in x but heterogenous in £ = 1, 2. The weights F{ dx) = dx in the loss 
function are uniform on X. The true ranking classifier C(x) is given by 


(4.1) 


C(x) 


2 for X G [0, ri] U [r 2 ,1] 
1 for ri < X < r 2 . 


where ri « 0.3193, r 2 ~ 0.9279. 



K = 100 


K = 400 


Figure 1. Response surface modeling with the Gap-SUR El criterion of (3.2). We plot the true surfaces p,t(x) 
(black dashed lines), the posterior means (blue/red solid lines), the 90% posterior credibility intervals (light 

blue/red areas) of Mi{x) and M 2 {x), and the sampling locations x^'^ for Yi{x) (blue triangles) and Y 2 {x) (red 
circles). The middle panel shows the local loss J\4{x), cf. (2.7), while the bottom panel shows the Gap-SUR El metric 
Ek{x,£) (blue: £ = 1, red: £ = 2). 

To focus on the performance of various acquisition functions, we fix the kriging kernels ICi to be 
of the Matern-5/2 type (2.11) with hyperparameters si = 0.1, 9i = 0.18 for /Ci and S 2 = 0.1 ,62 = ^ 
for /C 2 . These hyper-parameters are close to those obtained by training a kriging model for Y£{x) 
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given a dense design on X and hence capture well the smoothness of the response surfaces above. 
We use a fixed trend ti{x) = 0.5, and treat the given sampling noises as known. 

To apply Algorithm 1 we then initialize with Kq = 10 locations (five each from Yi(x) 

and Y 2 {x)), drawn from a LHS design on [0,1]. Note that because the kriging kernels are assumed 
to be known, Kq is taken to be very small. To grow the designs we employ the Gap-SUR El 
criterion and optimize for the next {x,£)^~^^ using a fresh candidate set based on a LHS design 
of size D = 100. Figure 1 illustrates the evolution of the posterior response surface models. The 
two panels show the estimated M^^\x) at AT = 100 and K = 400 (namely we plot the posterior 
means and the corresponding 90% Cl ± 1.645(5^^^ (x)). We observe that most of 

the samples are heavily concentrated around the two classification boundaries ri,r 2 , as well as 
the “false” boundary at x = 0. As a result, the kriging variance (^|(x) is much lower in those 
neighborhoods, generating the distinctive “sausage” shape for the posterior credibility intervals of 
M(_{x). In contrast, in regions where the gap A(x) is large (e.g., around x = 0.5), ranking the 
responses is easy so that almost no samples are taken and the kriging variance remains large. Also, 
because (Ti(x) > ct 2 (x), the credibility intervals of /i 2 are tighter, 5i(x) > 52{x), and more than 70% 
of the samples are from the first response Yi. Indeed, we find Di{k) ~ 3 D2{k) where 

Di{K) :=\{l<k<K : i’' = i}\ 

is the number of samples in the design from the i-th surface. The above observations confirm 
the double efficiency from making the El scores depend on both the X and £ dimensions. 

From a different angle. Figure 2 shows the resulting design in this example and the 

location of sampled sites x^ as a function of sampling order k = 1,...,400. We observe that 
the algorithm first engages in exploration and then settles into a more targeted mode, alternating 
between sampling around 0, ri and r 2 . 




Figure 2. Left: the design based on the Gap-SUR El criterion of (3.2). There were Di(400) = 294 and 

1)2(400) = 106 samples from Yi and Y 2 respectively. Right: sampled locations as a function of k (blue for = 1, 
red for =2/ 

4.2. Comparison and Discussion of El Criteria. As a first basis for comparison, we provide 
three non-adaptive designs. The simplest alternative is the uniform sampling method that relies 
purely on the law of large numbers to learn R(,{x). Thus, at each step k, we generate a new sampling 
location {x,£)^ uniformly from X x C. This generates a roughly equal number of samples Di[k) ~ 
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D 2 {k) from each response and a kriging variance 5|(x) that is approximately constant in x. Clearly, 
this approach yields an upper bound on the possible (empirical) loss. The second alternative is 
separate non-sequential modeling of each through a space-filling design (implemented via LHS); 
this improves on uniform sampling but does not attempt in any way to discriminate in the index 
dimension £. For this example, we take Di = 160 = 4 II 2 to be proportional to the observation 
noise of each surface. (Note that this strategy is roughly equivalent to building a global sequential 
maximin design using the acquisition function Ek{x,t) := 5i{x).) 

The third alternative is to build a sampling scheme that relies on the true With this 

foresight, we generate a design that relies on the actual complexity for resolving C{x) by plugging- 
in the true Ai{x) into the Gap-UCB metric in (3.1). Because sampling depends solely on A£(x) 
and the kriging variances 6^^^\x)‘^ are iteratively determined by the previous x^'^, cf. (2.10), the 
overall design x^'^ is deterministic (hence non-adaptive, but still implemented sequentially). Note 
that the resulting /i£(-)’s and hence outputted C(-) are still a function of 

Several further alternatives for evaluating expected improvement can be designed based on 
classification frameworks. For classification, the main posterior statistic is the probabilities P£(x) of 
fii{x) being the smallest response. One can then use the vector p{x) to measure the complexity of 
the resulting local classification at x. Note that such measures intrinsically aggregate across i and 
hence only depend on x. This suggests either using a two-step sampling procedure as in Section 
3.1.1 or building a UCB-like criterion as in (3.1). We employ the latter method, blending a criterion 
r(x) that discriminates among x-locations (larger scores are preferred) with UCB, leading to El 
scores of the form Ek{x,i) = r(^^(x) +'ykSi{x). Three different choices for F(-) are: 

(4.2) F®^'^(x) := - '^Piix) logpeix); 

e 

(4.3) := - \pBest{x) - PSBix)] ; 

(4.4) F^-*(x) := -pBestix), 

where PBest{x) := P (c{x) = C{x)\Ek^ = Pci^x) posterior probability that the lowest posterior 

mean is indeed the smallest response, and psB is the probability that the second-lowest posterior 
mean is the smallest response. 

The F®'^^ metric is the posterior entropy which is a standard measure of classification com¬ 
plexity. High entropy indicates more spread in p{x) and hence more uncertainty about which is the 
smallest component of /2(x). However, a well-known drawback of entropy is that for large L (bigger 
than 3) the responses that are very unlikely to be the minimum (i.e. small Pi{x)) still strongly 
affect the overall F^'^^(x), leading to non-intuitive shapes of the El scores. The Best-versus- 
Second-Best (BvSB) approach F^’^'^^(x) originating in [30], counteracts this effect by comparing 
just the two lowest posterior means. Small differences between pBest and psB indicate large uncer¬ 
tainty in identifying the minimum response. The BvSB metric can break down however if posterior 
variances d^^xYs are highly unequal, whereby the ordering between pi and pYs is not the same. 
Otherwise, is quite similar to the gap measure A(x). Lastly, focuses on the locations 

where PBest{x) <C 1, i.e. those close to classification boundaries of C(x). When L = 2, F^®^^ and 
YBvSB _ 2 _ 2pBest{x) give the same preferences. 

Note that because F does not discriminate among the surfaces, it is sensible to take 7 ?; = 7 a:(^) 
to be response-specific. Alternatively, the F metrics lend themselves to concurrent sampling which 
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builds an adaptive sequential design in X but treats all surfaces equally: 

(4.5) (, 1 :) = r<*>(x) + 74E 

e 

Yet another alternative is a so-called pure M-Gap heuristic that uses (3.2) via 

(4.6) = arg max M{x), = argmaxfj^(x^"*"^). 

t- 

This hierarchical sampling strategy can be viewed as generalizing the Efficient Global Optimization 
(EGO) criterion of [29] to the ranking problem, cf. the classification variant of EGO in [24]. 


Table 1 

True loss vs. empirical loss with for the 1-D example. For UCB heuristics the cooling schedule is of the 

form 7 fc = cVlog k with c as listed below. The error probability ErrProb measures the mean of 1 — p^g^gl (x) over the 
test set. D\ = Di(200) is the number of samples out of 200 total from Y\. 


Method 

Emp Loss 

(SE) 

True Loss 

(SE) 

ErrProb 

(SE) 

Di 

Uniform Sampling 

2.89E-3 

(1.24E-4) 

2.64E-3 

(2.67E-4) 

6.87% 

(0.25%) 

100 

Non-adaptive LHS 

2.16E-3 

(l.OlE-4) 

1.91E-3 

(2.12E-4) 

6.05% 

(0.22%) 

160 

Known-Gap-UCB, c = 4 

1.77E-3 

(8.35E-5) 

1.43E-3 

(1.91E-4) 

5.61% 

(0.23%) 

174 

Gap-SUR 

0.96E-3 

(4.98E-5) 

1.19E-3 

(1.84E-4) 

3.82% 

(0.17%) 

146 

Pure M-Gap 

1.20E-3 

(5.39E-5) 

1.81E-3 

(2.33E-4) 

4.28% 

(0.15%) 

172 

Concurrent M-Gap 

1.36E-3 

(8.33E-5) 

1.52E-3 

(1.97E-4) 

4.78% 

(0.24%) 

100 

Gap-UCB, c = 0.1 

2.62E-3 

(1.74E-4) 

2.23E-3 

(2.60E-4) 

5.46% 

(0.23%) 

163 

Gap-UCB, c = 0.25 

2.05E-3 

(1.02E-4) 

1.63E-3 

(2.43E-4) 

5.16% 

(0.19%) 

165 

Gap-UCB, c = 1 

1.27E-3 

(5.61E-5) 

1.50E-3 

(1.98E-4) 

4.39% 

(0.16%) 

167 

Gap-UCB, c = 5 

1.56E-3 

(7.29E-5) 

1.62E-3 

(2.14E-4) 

5.10% 

(0.20%) 

176 

Gap-UCB, c = 10 

1.83E-3 

(7.89E-5) 

1.48E-3 

(1.89E-4) 

5.49% 

(0.20%) 

172 

ri3est_ucB, c = 5 

1.29E-3 

(5.85E-5) 

1.35E-3 

(1.71E-4) 

4.53% 

(0.17%) 

172 

ri57VT_uCB, C = 5 

1.14E-3 

(6.02E-5) 

1.33E-3 

(1.80E-4) 

4.22% 

(0.18%) 

169 

Gap-SUR w/training ICi 

1.20E-3 

(5.87E-5) 

1.69E-3 

(3.24E-4) 

4.34% 

(0.37%) 

146 


4.3. Benchmarks. To judge the efficiency of different sequential designs, we proceed to bench¬ 
mark the performance of different approaches. Table 1 and Figure 3 compare the performance 
of El acquisition functions, including the three non-adaptive methods; Gap-SUR; Gap-UCB with 
different y^-schedules; methods based on posterior probabilities p(-): T^'^^-UCB entropy criterion 
based on (4.2) and r^®^*-UCB criterion based on (4.4); the pure M-gap heuristic (4.6); and con¬ 
current sampling with M-Gap. To construct the summary statistics in Table 1 we initialized each 
algorithm with a random LHS design of size Kq = 10 and augmenting it until K = 200 sites. 
Throughout, we compute both the true loss in this synthetic example where peix) are known, as 
well as the approximated empirical loss £C 

1 ^ 

(4.7) SC{C,C) = — ^{/i(i)(iAx) -m(jAx)} , 

i=i 

where we used M = 1000 = 1/Ax uniformly spaced gridpoints in T = [0,1]. A further metric 
reported is the error probability 1 — P^Belti^) which measures the posterior probability that the 
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identified minimum response is incorrect. Each method was run 100 times to compute the resulting 
mean and standard deviation of the loss function £ and the empirical loss ££. To isolate the 
effect of the El criterion, we continue with a fixed GP covariance structure for the nis and 
pre-specified ci^’s (see hyperparameter values in Sec. 4.1). 

The Gap-SUR algorithm appears to be the most efficient, in particular being much more efficient 
than a naive uniform sampler (or the non-adaptive LHS sampler). It also performs better than 
Gap-UCB or the pure M-Gap methods and moreover also has the smallest fluctuations across 
algorithm runs, indicating more stable behavior. Nevertheless, the UCB methods are nearly as 
good, in particular the entropy-based r^-^^-UCB approach is competitive. However, as discussed 
these methods are sensitive to the choice of the y^-schedule; the table shows that a poorly chosen 
can materially worsen performance. In this example, with 7 = c-y/log k, the scaling c = 1 works 
well, but if c is too small then the method is overly aggressive, and if c is too big the sampling 
is essentially space-filling. At the same time, a limitation of Gap-SUR is that it requires knowing 
the noise variances when optimizing the El acquisition function. Perhaps surprisingly, the 
Known-Gap-UCB strategy loses out to the adaptive methods. This happens because the empirical 
loss of the non-adaptive method is in fact rather sensitive to the observed samples which can 
generate erroneous estimates of nt{x) and mis-classified C{x). Gonsequently the Known-Gap-UCB 
design, while properly placing (x,!')^'^ on average, does not allow self-correction so that erroneous 
beliefs about can persist for a long time, increasing £C. In contrast, adaptive algorithms add 
samples to any regions where observations suggest that A(x) is small, sharpening accuracy there 
and lowering both true and empirical loss functions. 

The left panel of Figure 3 visualizes algorithm behavior as a function of design size k, by 
plotting the approximated empirical loss £C{C^^\C) from (4.7) for four representative strategies. 
All methods appear to enjoy a power-law (linear behavior on the log-log plot) for ££ as a function 
of k, with the slopes of the adaptive method strictly bigger than the non-adaptive ones. 



Figure 3. Left: Averaged empirical loss £C{C^^^) as a function of design size k (in log-log scale). We compare 
our adaptive Gap-SUR (3.2) and Gap-UCB methods (3.1) (with 7*. = 1 • Ulog k) against a uniform sampler and a 
Known-Gap-UCB based on the true gap A(-). Right: boxplot of C{C^^\C) at K = 400 computed via (4.7), across six 
different El approaches. 


Table 1 also highlights the gain from discriminating among the response surfaces, as the Con- 
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current M-Gap algorithm is notably worse (with losses of about 30% higher) relative to Gap-SUR. 
The only difference between these methods is that Gap-SUR sampled Yf 146 times out of 200, while 
the concurrent method was constrained to sample each response exactly 100 times. All approaches 
that optimize over the full A x T focus on fitting the noisier Yi, sampling it 70-85% of the rounds 
(see the Di column). 

As a final comparison, the last row of Table 1 reports the performance of the Gap-SUR method 
in the practical context where one must also train the GP kernels /C^’s by learning 6i, s^, a. All the 
parameters, including the observation noise a which is viewed as the nugget of the GP covariance 
structure, are estimated via MLE. Since training introduces additional noise into the fitted response 
surfaces, algorithm performance is necessarily degraded, especially in terms of variation across 
algorithm runs. This could indicate that the stationary GP model is not ideal here. 

Table 1 also shows that the empirical and actual loss C{C^^\C) metrics are consistent, 

so that the former can be used as an internal online assessment tool to monitor accuracy of the 
estimated classifier. Mismatch between the two measures is driven by model mis-specification, as 
incorrectly inferred covariance structure of /Ui(x) leads to over-optimism: £C < C. This issue is 
largely independent of the sampling scheme and pertains more to the modeling framework than to 
El acquisition functions. 

4.4. Many Surfaces. Our next example treats a more complex setting with L = 5 surfaces and 
a 2 -dimensional input space X = [— 2 , 2 ]^: 



Response 

GP Parameters { 9 i, 62 ,rf ,tf) 

/ri(xi,X2) 

2 — x\ — 0 . 5^2 

(4,6.5,23,-10) 

£2(a;i,X2) 

2 (xi - if + 2x1-2 

(7.5,7.5,475,60) 

t^2.{xi,X2) 

2 sin( 2 xi) -|- 2 

(1,8, 2,1.9) 

t^i{xi,X2) 

8(xi - if + ^xl-2, 

(8,8,8000,300) 

£ 5 ( 3 ^ 1 , X 2 ) 

Q.b{xi + ‘if+ lQxl-Q 

(8,4,2500,150) 


We assume constant homoskedastic observation noise e^{xi,X 2 ) ~ AA(0, it|), ai = 0.5 V£. The GP 
models have separable anisotropic Matern-5/2 covariance functions with the specified hyperparam¬ 
eters, and fixed trend ti. Eigure 4 shows the corresponding classifier C. 

The sequential designs were initialized at Kq = 50 by generating 10 LHS samples from each 
Yi{x\,X 2 )] at each step the sampling locations were selected from a LHS candidate set T of size 
D = 100 using the randomized e-greedy method with e = 0.1. 

The top-left panel of Eigure 4 shows the estimated classifier after K = 500 samples in total 
using the Gap-SUR acquisition function, and the other panels display the locations of the sampled 
x’s as allocated for each ^ = 1,... ,5. As can be seen, the algorithm is highly discriminating in 
sampling jointly on A x £. At any given classification boundary, the algorithm effectively only 
sampled two out of the five responses, endogenously recovering the concept of Best-versus-Second 
Best testing. Thus, samples from Y^ are mostly located around the boundaries of surface and 
other surfaces. These contours, where — min^^^ = 0, are precisely the regions targeted 

by the Gap El metrics. Because Ci and C 5 have the longest boundaries, relatively more samples 
were chosen there {Di = 126,175 = 109); conversely the smallest set is C 4 which only received 
£>4 = 70 samples. 

Table 2 presents the relative performance of different acquisition functions. Specifically, we 
compare (i) uniform sampling; (ii) space-filling LHS sampling; (iii) concurrent strategy (4.5) 
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£=3 i=4 £=5 

Figure 4. 2-D Ranking on X = [—2,2] x [—2,2] using the Gap-SUR heuristic. Top-left panel: The solid black 
lines show the true C(xi,X 2 ), the dashed red lines show the estimated classifier C''^\x\,X 2 ) for K = 500. The other 
panels show the marginal designs (xi, for each of the 5 response surfaces. Shading indicates the estimated 

empirical gaps (xi, * 2 ), ^ = 1,..., 5. We observe that most samples gravitate towards regions where Ae ~ 0. Solid 
curves indicate boundaries of the true classifier C{x\,X 2 ). 


which is analogous to entropy-based sampling; (iv) Gap-UCB, and (v) Gap-SUR. We note that 
with many surfaces, the key is not necessarily the budget allocation among the surfaces (here, 
with identical cr^, optimal Dis are roughly equal), but efficient placement of sample locations that 
are most appropriate for each surface. This effect can be observed by comparing a non-adaptive 
strategy (that is space-filling in both x and .^), to a concurrent T^®^* strategy (4.5) (that targets 
classification boundaries but is uniform in i), to a Gap-SUR/Gap-UCB strategy (that targets 
different parts of classification boundaries for different indices t). Each step in the above sequence 
generates substantial performance gains; it is expected to be even more pronounced when the 
observation noise is index- (or state-) dependent. 

5. Case Study in Epidemics Management. Our last example is based on control problems 
in the context of infectious epidemics [32, 34, 35, 39]. Gonsider the stochastic SIR model which 
is a compartmental state-space model that partitions a population pool into the three classes 
of Susceptible counts St, Infecteds It and Recovereds Rt. We assume a fixed population size 
M = St+h+Rt so that the state space is the two-dimensional simplex X = {{s,i) £ : s-|-i < M}. 

In a typical setting, M £ [10^, 10^], so that X is discrete but too large to be explicitly enumerated 
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Table 2 

True loss vs. empirical loss with for the 2-D example. For UCB heuristics the cooling schedule is of the 

form 7 fe = c-yiog A:. The error probability is ErrProb = Ave{l —p^^l (®)) over the test set. The vector Dt{500) lists 
the number of samples out of 500 total from Ye, £ = 


Method 

Emp Loss 

(SE) 

True Loss 

(SE) 

ErrProb 

Index Allocations De 

Uniform Sampling 

6.43E-3 

(4.64E-5) 

5.47E-3 

(2.39E-4) 

4.10% 

(100,100,100,100,100) 

Non-Adaptive LHS 

5.97E-3 

(2.31E-5) 

4.72E-3 

(1.97E-4) 

3.92% 

(100,100,100,100,100) 

Cone c = 0.5 

5.11E-3 

(1.93E-5) 

4.04E-3 

(1.50E-4) 

3.66% 

(100,100,100,100,100) 

Gap-SUR 

3.46E-3 

(1.32E-5) 

3.17E-3 

(1.29E-4) 

3.06% 

(126, 101, 94, 70, 109) 

Gap-UCB, c = 0.5 

3.41E-3 

(1.45E-5) 

2.97E-3 

Q.14E-4) 

3.05% 

Q29, 103, 104, 72, 92) 


(on the order of \A^\ ~ 10®). The dynamics of {St, It) are time-stationary and will be specified 
below in (5.3). 

The goal of the controller is to mitigate epidemic impact through timely intervention, such as 
social distancing measures that lower the infectivity rate by reducing individuals’ contact rates; 
mathematically this corresponds to modifying the dynamics of {St, It)- To conduct cost-benefit 
optimization, we introduce on the one hand epidemic costs, here taken to be proportional to the 
number of cumulative infecteds, and on the other hand intervention costs, that are proportional to 
the current number of remaining susceptibles St. Intervention protocol can then be (myopically) 
optimized by comparing the expected cost of no-action /io(s,i) (conditional on the present state 
{s,i)) against the expected cost of immediate action, fiA{s,i). More precisely, let 

(5.1) ftQ{s,i) := ¥P[Sq — St\Io = i. So = s] and 

(5.2) TLA{s,i) :=E^[So-ST\Io = i,So = s]+C^s. 

Above, T = inf{t : It = 0} is the random end date of the outbreak; due to the fixed population 
and posited immunity from disease after being infected, the epidemic is guaranteed to have a finite 
lifetime. The difference Sq — St thus precisely measures the total number of original susceptibles 
who got infected at some point during the outbreak. 

The overall goal is then to rank po and ft a, with the intervention region corresponding to 
{(s, i) : fa{s, i) > fto{s, *)}• Because no analytic formulas are available for m's, a sensible procedure 
(also preferred due to the ease of handling numerous extensions of SIR models) is a Monte Carlo 
sampler that given an initial condition So = s, Iq = i and regime i G {0, A} generates a trajectory 
(5t,It)(a;) and uses it to evaluate the pathwise 5r(u;), connecting to the framework of (1.1). 

From the policy perspective, the trade-off in (5.1)-(5.2) revolves around doing nothing and 
letting the outbreak run its course, which carries a unit cost for each individual that is eventually 
infected, or implementing preventive social distancing measures which costs for each susceptible, 
but lowers the expected number of future infecteds. Typical countermeasures might be public ad 
campaigns, school closures, or distribution of prophylactic agents. In general, intervention is needed 
as soon as there is a threat of a big enough outbreak. However, if It is low, the cost of intervention 
is too high relative to its benefit because the epidemic might end on its own. Similarly, if St is 
low, the susceptible pool is naturally exhausted, again making intervention irrelevant (due to being 
“too late”). Quantifying these scenarios requires a precise probabilistic model. 

The dynamics of {St, It) under the respective laws P® and follow continuous-time Markov 
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chains with the following two transition channels: 

Infection :S + I ^ 21 with rate [3^Stit/M, j = 0, A; 

Recovery :I ^ R with rate yR. 

Above, (3^ < j3^ is interpreted as lowered contact rate among Infecteds and Susceptibles in the in¬ 
tervention regime, which thereby reduces outbreak growth and impact. The Markov chain {St, It) 
described in (5.3) is readily simulatable using the Gillespie time-stepping algorithm [19], utilizing 
the fact that the sojourn times between state transitions have (state-dependent) Exponential dis¬ 
tributions, and are independent of the next transition type. These simulations are however rather 
time-consuming, requiring 0{M) Uniform draws. Consequently, efficient ranking of expected costs 
is important in applications. 

Remark 5.1. Since (5.3) implies that eaeh individual infected period has an independent Exp{'y) 
distribution it follows that E[S'o — St] = 7 lE[/g Rdt], so that (5.1) can also be interpreted as 
proportional to total expeeted infected-days. 

We note that in this example the input space X is discrete, which however requires minimal 
changes to our implementation of Algorithm 1 . The biggest adjustment is the fact that the noise 
variances cr|(x) in (1.2) are unknown. Knowledge of cr|(x)’s is crucial for training the GP covariance 
kernel ICi, see e.g. (2.9). Indeed, while it is possible to simultaneously train JCi and a constant 
observation noise a (the latter is known as the “nugget” in GP literature, and can be inferred 
via maximum likelihood), with state-dependent noise JC is not identifiable. We resolve this issue 
through a batching procedure (compare to [3, Sec 3.1]) to estimate cr|(x) on-the-go. Namely, we 
re-use the same site x = (s, i) r-times, to obtain independent samples {x), ..., yP{x) from the 
corresponding Yi{x). This allows to estimate the conditional variance 

- y£(x))^ where yi{x) = {x) 

i=\ i=\ 

is the sample mean. Moreover, as shown in [40, Sec 4.4.2] we can treat the r samples at x as the single 
design entry (x, ye{x)) with noise variance 'a\{x)lr. The resulting reduction in post-averaged design 
size by a factor of r offers substantial computational speed-up in fitting and updating the kriging 
model. Formally, the El step in Algorithm 1 is replaced with using {x^~^^, £fc+ 2 ) ^ 

... = (x^"*"”, and re-computing the El score once every r ground-level iterations. 

For our study we set M = 2000, (3^ = 0.75,= 0.5, 7 = 0.5 with intervention cost of 

= 0.25 per susceptible. Figure 5 shows the resulting decision boundary dC. In the dark region 
the relative cost of intervention is lower, and hence action is preferred. For example, starting at 
Iq = 10, = 1800, without any action the outbreak would affect more than 40% of the susceptible 

population (expected cost of about 800), while under social distancing the impact would be about 
60 infecteds (leading to much lower total expected cost of 60 -|- Sq ~ 510). In the light region, 
wait-and-see approach has lower expected costs. For example at Iq = 50, So = 1400, the expected 
number of new infecteds without any action is 385 while the cost of countermeasures is bigger at 
0.25 X 1400 -|- 102 = 452. Overall, Figure 5 shows that the optimal decision is very sensitive to 
the current number of susceptibles Sq. This feature is due to the fact that outbreaks are created 
when the infection rate dominates the recovery (reproductive ratio TZo := {/3^/'y){So/M) above 1). 
Hence, for a pool with more than 85% susceptibles (5o > 1700), the initial growth rate satisfies 
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Figure 5. Fitted response boundary dC for the epidemic response example using the Gap-SUR expected improve¬ 
ment metric. The scatterplot indicates the design for K = 200; triangles indicate the initial design Z^^°\ and 
circles the adaptively placed {s,i)^°'^ (green: Yq; yellow: Ya). 


P^SofM > 7 and is likely to trigger an outbreak. However, as S is lowered, the region where 
P^Sq/M ~ 7 is approached, which makes social distancing unnecessary, as outbreak likelihood and 
severity diminishes. In particular. Figure 5 shows than no action is undertaken for So < 1350. In 
the intermediate region, there is a nontrivial classifier boundary for determining C(s,i). 

Figure 5 was generated by building an adaptive design using the Gap-SUR acquisition function 
and a total of iF = 200 design sites, with r = 100 batched samples at each site. The input space 
was restricted to T" = {s G {1200,..., 1800}, i G {0,200}}. The initial design Z(^o) included 
50 = 25 X 2 sites on the same rectangular 5x5 lattice for each of YqjYa- In this example, the noise 
levels are highly state-dependent, see Figure 6. The fio surface has much higher noise, with 

largest cjQ(s,ii) for (s,i) 5^; (1800,5)), whereas fa has largest noise in the top right corner. As a 
result, contains mostly samples from Yq and is denser towards the bottom of the Figure. 

6. Conclusion. In this article we have constructed several efficient sequential design strategies 
for the problem of determining the minimum among L > 2 response surfaces. Our Gap-SUR 
heuristic connects (1.1) to contour-finding and Bayesian optimization, providing a new application 
of the stepwise uncertainty reduction framework [10]. Our Gap-UCB heuristic mimics multi-armed 
bandits by treating all possible sampling pairs in A x £ as arms, and trying to balance arm 
exploration and exploitation. 

Our approach is based on the kriging framework, but this is primarily for convenience and is not 
crucial. To this end, instead of a Bayesian formulation, one could use a maximum-likelihood method 
to fit Fi{-), replacing the posterior Mg{x) with the point estimator and its standard error. Hence, 
many other regression frameworks could be selected. However, computational efficiency and the 
sequential framework place several efficiency restrictions on possible ways to modeling Fi{-). On the 
one hand, we need strong consistency, i.e. the convergence of the respective classifier -A C as 
K -A oo. In particular, the regression method must be nonparametric and localized. On the other 
hand, we wish for a sequential procedure that allows for efficient updating rules in moving from 
to Lastly, in practical settings further challenges such as heteroscedasticity, non-Gaussian 
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Figure 6. Estimated noise standard deviations ai{s,i) for the epidemic response example in the no¬ 
countermeasures (left panel, I = Q) and action (right panel, I = A) regimes. Note the different color scales of 
the two panels, with (to(') S> crA(') for all {s,i). 


samplers Yi, and heterogenous structure of the response surface are important. 

One suitable alternative to GP’s is local regression or Loess [44], which is a nonparametric 
regression framework that fits pointwise linear regression models for Loess is efficient and 

well-suited for heteroscedastic contexts with unknown noise distributions as in Section 5. It also au¬ 
tomatically generates the posterior mean and variance of the fit (allowing to use the derived formulas 
based on and 5(,{x)). However, Loess is not updatable, creating computational bottlenecks 

if many design augmentation iterations are to be used. At the same time htting is extremely fast, 
so depending on the implementation it might still be competitive with more sophisticated meth¬ 
ods. In this spirit, piecewise linear regression (which hrst partitions X into several cells and then 
carries out least-squares regression in each cell) is updatable via the Sherman-Morrison-Woodbury 
formulas and could be employed if there is a clear partitioning strategy available. 

We further note that GP kriging is just a convenient interim surrogate for building the experi¬ 
mental design. Consequently, once Z is generated, one could switch to a different response surface 
model to build a final estimate of the //^’s and hence C. For example, the treed GP approach [24] 
allows for a higher-fidelity fit for the response surfaces when the underlying smoothness (specihed 
by the covariance kernel) strongly varies across X. Because treed GP models are expensive to fit, 
one could compromise by using vanilla GP during DoE and treed GP for the final estimate of C. 

Another fruitful extension would be to investigate ranking algorithms in the fixed confidence 
setting. As presented, the sequential ranking algorithm is in the fixed budget setting, augmenting 
the design until a pre-specified size K. Practically, it is often desirable to prescribe adaptive, data- 
driven termination by targeting a pre-set conhdence level. A good termination criterion should take 
both accuracy and efficiency into account, ensuring the accuracy of (ie{x) and also anticipating low 
information gain from further sampling steps. One proposed termination criterion is to keep track 
of the evolution of the empirical loss and terminate once £:£(C(0) _ ££(cik+i)) is small 
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enough. This is equivalent to minimizing ■= + efc, where e > 0 is a parameter for cost 

of simulations; the more we care about efficiency, the larger the e is. When the design size k is 
small, the first term will dominate, so Lj. is expected to first decrease in k. As fe —?• oo, the rate 
of improvement in the loss function shrinks so that eventually Lj. will be increasing. However, we 
find that £C{C^^^) is quite noisy, especially if the kriging models are re-trained across stages. In 
that sense, the termination criterion needs to be robust enough to generate sufficiently strong (ad 
hoc) guarantees that a certain tolerance threshold has truly been achieved. 
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